System and method for I/O error recovery

ABSTRACT

A system and method for recovering from an I/O error in a distributed object-based storage system that includes a plurality of object storage devices for storing object components, a manager coupled to each of the object storage devices, wherein the object storage devices coordinate with the file manager, and one or more clients that access and store distributed, object-based files on the object storage devices. A client attempts to perform an operation selected from the group consisting of: a data read operation from an object storage device, a data write operation to an object storage device, a set attribute operation to an object storage device, a get attribute operation from an object storage device and a create object operation to an object storage device. Upon failure of the operation, the client sends a message from the client to the manager that includes information representing a description of the failure.

FIELD OF THE INVENTION

The present invention generally relates to data storage methodologies,and, more particularly, to systems and methods for recovery from I/Oerrors in distributed object-based storage systems in which a clientimplements RAID algorithms.

BACKGROUND OF THE INVENTION

With increasing reliance on electronic means of data communication,different models to efficiently and economically store a large amount ofdata have been proposed. A data storage mechanism requires not only asufficient amount of physical disk space to store data, but variouslevels of fault tolerance or redundancy (depending on how critical thedata is) to preserve data integrity in the event of one or more diskfailures.

In a traditional RAID networked storage system, a data storage device,such as a hard disk, is connected to a RAID controller and associatedwith a particular server or a particular server having a particularbackup server. Thus, access to the data storage device is available onlythrough the server associated with that data storage device. A clientprocessor desiring access to the data storage device would, therefore,access the associated server through the network and the server wouldaccess the data storage device as requested by the client. In suchsystems, RAID recovery is performed in a manner that is transparent tothe file system client.

By contrast, in a distributed object-based data storage system that usesRAID, each object-based storage device communicates directly withclients over a network. An example of a distributed object-based storagesystem is shown in co-pending, commonly-owned, U.S. patent applicationSer. No. 10/109,998, filed on Mar. 29, 2002, titled “Data File Migrationfrom a Mirrored RAID to a Non-Mirrored XOR-Based RAID Without Rewritingthe Data,” incorporated by reference herein in its entirety.

In many failure scenarios in a distributed object-based file system, thefailure can only be correctly diagnosed and corrected by a systemmanager that knows about and can control system specific devices. Forexample, a failure can be caused by a malfunctioning object-storagedevice and the ability to reset such device is reserved for securityreasons only to the system manager unit. Therefore, when a client failsto write to a set of objects, the client needs to report that failure tothe system manager so that the failure can be diagnosed and correctiveactions can be taken. In addition, the file system manager must takesteps to repair the object's parity equation.

In instances where a client fails to write to a set of objects, it wouldbe desirable if the role of the system manager was not limited torepairing the error condition, but also extended to repair of theaffected file system object's parity equation. Expansion of the role ofthe system manager to include correction of the parity equation isadvantageous because the system will no longer need to depend on thefile system client that encountered a failure to be able to repair theobject's parity equation. The present invention provides an improvedsystem and method that, in instances where there is an I/O error,transmits information to the system manager sufficient to permit thesystem manager to repair the parity equation of the object associatedwith the I/O error.

SUMMARY OF THE INVENTION

The present invention is directed to recovering from an I/O error in adistributed object-based storage system that includes a plurality ofobject storage devices for storing object components, a manager coupledto each of the object storage devices, wherein the object storagedevices coordinate with the file manager, and one or more clients thataccess and store distributed, object-based files on the object storagedevices.

In one embodiment of the present invention, a client attempts to performan operation on data that is the subject of the operation, the operationbeing selected from the group consisting of: a data write operation toan object storage device, a set attribute operation to an object storagedevice, and a create object operation to an object storage device. Uponfailure of the operation, the client sends a single message from theclient to the manager that includes information representing adescription of the failure and the data that was the subject of theoperation. The data that is the subject of the operation may beuser-data or parity data. In one embodiment, the distributedobject-based system is a RAID system, and the data in the message isused to correct a parity equation associated with the data in themessage and other data on one or more of the object storage devices.

In accordance with a further embodiment, a client attempts to perform anoperation selected from the group consisting of: a data read operationfrom an object storage device, a data write operation to an objectstorage device, a set attribute operation to an object storage device, aget attribute operation from an object storage device, and a createobject operation to an object storage device. Upon failure of theoperation, a message is sent from the client to the manager thatincludes information representing a description of the failure. Thus, incontrast to existing distributed object-based systems, in the presentinvention the client actively participates in failure recovery.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are included to provide a furtherunderstanding of the invention and are incorporated in and constitute apart of this specification, illustrate embodiments of the invention thattogether with the description serve to explain the principles of theinvention. In the drawings:

FIG. 1 illustrates an exemplary network-based file storage systemdesigned around Object-Based Secure Disks (OBDs);

FIG. 2A illustrates an exemplary data object formed of componentsstriped across different OBDs;

FIG. 2B illustrates a state of the data object from FIG. 2A after afailure has occurred; and

FIG. 2C illustrates a state of the data object from FIG. 2B afterrecovery from the failure is performed using the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

Reference will now be made in detail to the preferred embodiments of thepresent invention, examples of which are illustrated in the accompanyingdrawings. It is to be understood that the figures and descriptions ofthe present invention included herein illustrate and describe elementsthat are of particular relevance to the present invention, whileeliminating, for purposes of clarity, other elements found in typicaldata storage systems or networks.

FIG. 1 illustrates an exemplary network-based file storage system 100designed around Object Based Secure Disks (OBDs) 20. File storage system100 is implemented via a combination of hardware and software units andgenerally consists of manager software (simply, the “manager”) 10, OBDs20, clients 30 and metadata server 40. It is noted that each manager isan application program code or software running on a correspondingserver. Clients 30 may run different operating systems, and thus presentan operating system-integrated file system interface. Metadata stored onserver 40 may include file and directory object attributes as well asdirectory object contents. The term “metadata” generally refers not tothe underlying data itself, but to the attributes or information thatdescribe that data.

FIG. 1 shows a number of OBDs 10 attached to the network 50. An OBD 10is a physical disk drive that stores data files in the network-basedsystem 100 and may have the following properties: (1) it presents anobject-oriented interface (rather than a sector-oriented interface); (2)it attaches to a network (e.g., the network 50) rather than to a databus or a backplane (i.e., the OBDs 10 may be considered as first-classnetwork citizens); and (3) it enforces a security model to preventunauthorized access to data stored thereon.

The fundamental abstraction exported by an OBD 10 is that of an“object,” which may be defined as a variably-sized ordered collection ofbits. Contrary to the prior art block-based storage disks, OBDs do notexport a sector interface at all during normal operation. Objects on anOBD can be created, removed, written, read, appended to, etc. OBDs donot make any information about particular disk geometry visible, andimplement all layout optimizations internally, utilizing higher-levelinformation that can be provided through an OBD's direct interface withthe network 50. In one embodiment, each data file and each filedirectory in the file system 100 are stored using one or more OBDobjects. Because of object-based storage of data files, each file objectmay generally be read, written, opened, closed, expanded, created,deleted, moved, sorted, merged, concatenated, named, renamed, andinclude access limitations. Each OBD 10 communicates directly withclients 30 on the network 50, possibly through routers andor bridges.The OBDs, clients, managers, etc., may be considered as “nodes” on thenetwork 50. In system 100, no assumption needs to be made about thenetwork topology except that each node should be able to contact everyother node in the system. Servers (e.g., metadata servers 40) in thenetwork 50 merely enable and facilitate data transfers between clientsand OBDs, but the servers do not normally implement such transfers.

Logically speaking, various system “agents” (i.e., the managers 10, theOBDs 20 and the clients 30) are independently-operating networkentities. Manager 10 may provide day-to-day services related toindividual files and directories, and manager 10 may be responsible forall file- and directory-specific states. Manager 10 creates, deletes andsets attributes on entities (i.e., files or directories) on clients'behalf. Manager 10 also carries out the aggregation of OBDs forperformance and fault tolerance. “Aggregate” objects are objects thatuse OBDs in parallel and/or in redundant configurations, yielding higheravailability of data and/or higher I/O performance. Aggregation is theprocess of distributing a single data file or file directory overmultiple OBD objects, for purposes of performance (parallel access)and/or fault tolerance (storing redundant information). In oneembodiment, the aggregation scheme associated with a particular objectis stored as an attribute of that object on an OBD 20. A systemadministrator (e.g., a human operator or software) may choose anysupported aggregation scheme for a particular object. Both files anddirectories can be aggregated. In one embodiment, a new file ordirectory inherits the aggregation scheme of its immediate parentdirectory, by default. Manager 10 may be allowed to make layout changesfor purposes of load or capacity balancing.

The manager 10 may also allow clients to perform their own I/O toaggregate objects (which allows a direct flow of data between an OBD anda client), as well as providing proxy service when needed. As notedearlier, individual files and directories in the file system 100 may berepresented by unique OBD objects. Manager 10 may also determine exactlyhow each object will be laid out—i.e., on which OBD or OBDs that objectwill be stored, whether the object will be mirrored, striped,parity-protected, etc. Manager 10 may also provide an interface by whichusers may express minimum requirements for an object's storage (e.g.,“the object must still be accessible after the failure of any one OBD”).

Each manager 10 may be a separable component in the sense that themanager 10 may be used for other file system configurations or datastorage system architectures. In one embodiment, the topology for thesystem 100 may include a “file system layer” abstraction and a “storagesystem layer” abstraction. The files and directories in the system 100may be considered to be part of the file system layer, whereas datastorage functionality (involving the OBDs 20) may be considered to bepart of the storage system layer. In one topological model, the filesystem layer may be on top of the storage system layer.

A storage access module (SAM) (not shown) is a program code module thatmay be compiled into managers and clients. The SAM includes an I/Oexecution engine that implements simple I/O, mirroring, map retrieval,striping and RAID parity algorithms discussed below. (For purposes ofthe present invention, the term RAID refers to any RAID level orconfiguration including, e.g., RAID-1, RAID-2, RAID-3, RAID-4 andRAID-5, etc.) The SAM also generates and sequences the OBD-leveloperations necessary to implement system-level I/O operations, for bothsimple and aggregate objects.

Each manager 10 maintains global parameters, notions of what othermanagers are operating or have failed, and provides support for up/downstate transitions for other managers. A benefit to the present system isthat the location information describing at what data storage device(i.e., an OBD) or devices the desired data is stored may be located at aplurality of OBDs in the network. Therefore, a client 30 need onlyidentify one of a plurality of OBDs containing location information forthe desired data to be able to access that data. The data may bereturned to the client directly from the OBDs without passing through amanager.

FIG. 2A illustrates an exemplary data object formed of components A, Band C which are striped across different OBDs 20. A parity value (P) isassociated with components A, B, C and stored on one of the OBDs 20. Inthe present invention, when a client 30 attempts to perform an operationon a data object (e.g., object 200), and the operation fails for anyreason, client 30 sends a single message from the client 30 to themanager 10 that includes information representing a description of thefailure and any data that was the subject of the operation. Morespecifically, when client 30 attempts to perform a data read operationfrom an OBD 20, a data write operation to an OBD 20, a set attributeoperation to an OBD 20, or a get attribute operation from an OBD 20, andthe attempt results in an I/O failure, the client 30 sends a singlemessage from the client 30 to the manager 10 that includes informationrepresenting a description of the failure and any data (e.g., dataobject 200) that was the subject of the operation. Data that was thesubject of the operation may alternatively be user-data or parity data.For certain operations such as a data read operation from an OBD 20 or aget attribute operation from OBD 20, there will be no data that was thesubject of a failed operation and, in such cases, the message from theclient 30 to the manager 10 may only include information representing adescription of the failure.

In one embodiment (illustrated by the example below), where thedistributed object-based storage system is a RAID storage system, theportion of the message sent from client 30 to manager 10 that includesthe data that was the subject of the failed I/O operation is used bymanager 10 to correct a parity equation associated with such data andother data on one or more of the object storage devices. Referring nowto FIG. 2A, components A, B and C of object 200 are striped acrossdifferent OBDs 20, and a parity equation value (P) associated withobject 200 may be stored on a further OBD 20. When the OBDs 20containing components A, B and C are operating in a non-degraded mode,the parity equation may represented as equation (1) below:(P)=A⊕B⊕C  (1)

Next, assume that a client 30 attempts a write operation to segments A,B, C, and P of object 200, and the write operation for A′ and C′ fails(this condition is shown in FIG. 2B). C′ fails because OBD1 had suffereda permanent failure, while the write of A′ failed due to a transientfailure (i.e., OBD2 is actually functional). In accordance with thepresent invention, client 30 will, in response to the failure, return tomanager 10 both a description of the failure and the data that was beingwritten (A′ and C′). Using information from the message, and probing theOBDs reported in the error log, manager 10 can deduce that OBD1 haspermanently failed, and OBD2 is functional. Moreover, manager 10 canalso correct the parity equation associated with object 200 becausemanager 10 also possesses the data that client 30 attempted to write.The parity equation is corrected by ensuring A′ and P′ have been writtento their respective OBDs (this condition is shown in FIG. 2C). Had theclient had not forwarded the data, such a recovery would have beenimpossible. The parity corrected equation is represented by equation (2)below:(P′)=A′⊕B′⊕C′  (2)

Finally, it will be appreciated by those skilled in the art that changescould be made to the embodiments described above without departing fromthe broad inventive concept thereof. It is understood, therefore, thatthis invention is not limited to the particular embodiments disclosed,but is intended to cover modifications within the spirit and scope ofthe present invention as defined in the appended claims.

1. In a distributed object-based storage system that includes aplurality of object storage devices for storing object components, amanager coupled to each of the object storage devices, wherein theobject storage devices coordinate with the manager, and one or moreclients that access and store distributed, object-based files on theobject storage devices, a method for recovering from an I/O error,comprising: attempting, by a client, to perform a data storage operationwherein the client attempts to store data that is the subject of theoperation on one or more of the object storage devices, the operationselected from the group consisting of: a data write operation to anobject storage device, a set attribute operation to an object storagedevice, and an object create operation to an object storage device; andupon failure of the operation, sending a single message from the clientto the manager that includes (a) information representing a descriptionof the failure and (b) the data that the client attempted to store thatwas the subject of the operation.
 2. The method of claim 1, wherein thedata that is the subject of the operation and that the client attemptedto store is user-data or parity data.
 3. The method of claim 1, furthercomprising using the data in the message, a parity equation associatedwith the data in the message and other data on one or more of the objectstorage devices, to correct data associated with the failure.
 4. Themethod of claim 1, wherein the distributed object-based system is a RAIDsystem.
 5. In a distributed object-based storage system that includes aplurality of object storage devices for storing object components, amanager coupled to each of the object storage devices, wherein theobject storage devices coordinate with the manager, and one or moreclients that access and store distributed, object-based files on theobject storage devices, a system for recovering from an 110 error,comprising: a client that attempts to perform a data storage operationwhere the client attempts to store data that is the subject of theoperation on one or more of the object storage devices, the operationselected from the group consisting of: a data write operation to anobject storage device, a set attribute operation to an object storagedevice, and a create object operation from to an object storage device;and wherein, upon failure of the operation, the client sends a singlemessage from the client to the manager that includes (a) informationrepresenting a description of the failure and (b) the data that theclient attempted to store that was the subject of the operation.
 6. Thesystem of claim 5, wherein the data that is the subject of the operationand that the client attempted to store is user-data or parity data. 7.The system of claim 5, wherein said system uses the data in the message,a parity equation associated with the data in the message and other dataon one or more of the object storage devices, to correct data associatedwith the failure.
 8. The method of claim 5, wherein the distributedobject-based system is a RAID system.